Show the code
# Install packages
library(worldfootballR)
library(tidyverse)In B1700 you have started to learn the basics of R and in the previous practical for B1701 you learned how to load in multiple files at ones. However, there are occasions when your data is not stored in flat files and you may want to pull data from an online database or websites. Going into how to do this without using predefined R library’s is beyond the aims of this course, however, there are many R library’s available which can help you pull data from the web. Examples are:worldfootballR, baseballr, hoopR, SwimmeR, StatsBomb etc. All these packages come with instructions as to how to use them to pull relevant data from a variety of sources and it is worth having a look at some of these. However, for this practical we will use the worldfootballR package. worldfootballR pulls football data from FBRef, Transfermarket, Understat, and fotmob. We will focus on FBRef data today but it’s worth exploring the data pulled from the other websites.
To start, begin by installing the worldfootballR via install.packages(“worldfootballR”)
Next we need to load this package as well as tidyverse.
# Install packages
library(worldfootballR)
library(tidyverse)Once you have successfully installed and loaded all the necessary packages, you can begin reading your data.
worldfootballR uses several functions to load data in to R. You can find a detailed explanation of all functions here. Some of the main functions we will use today are:
fb_match urls() is used to get the match urls for the correct league and season
fb_advanced_match_stats() is used to extract player or team stats for each match in the relevant league. You can extract summary, passing, passing type, defensive, possession, miscellaneous and goalkeeping stats.
First up we want to get an overview of the team performance data. We will focus on season 2022 to 2023 and 1st tier men’s competitions in Spain (La Liga).
We will first get all the match urls for all La Liga matches in 2022/2023. Once we have these we will use them to load in the different statistics available for each player and last we will merge the individual statistics tables into a combined table. Note the code below scrapes all the data of the internet and will take a while, patience is your friend.
# Get match Urles
MatchUrls <- fb_match_urls(country = "ESP", gender = "M", season_end_year = 2022, tier="1st")
# Get different team stats
TeamStats2023SumDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "summary", team_or_player = "team")
TeamStats2023PassDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "passing", team_or_player = "team")
TeamStats2023PassTDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "passing_types", team_or_player = "team")
TeamStats2023DefDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "defense", team_or_player = "team")
TeamStats2023PosDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "possession", team_or_player = "team")
TeamStats2023MiscDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "misc", team_or_player = "team")
TeamStats2023KeeperDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "keeper", team_or_player = "team")
# Combine the team stats data into one DF
CombinedTeamDataDF<-merge(TeamStats2023SumDF,TeamStats2023PassDF, all=TRUE)
CombinedTeamDataDF <- CombinedTeamDataDF %>%
merge(TeamStats2023PassTDF,all=TRUE)%>%
merge(TeamStats2023DefDF, all=TRUE)%>%
merge(TeamStats2023PosDF, all=TRUE) %>%
merge(TeamStats2023MiscDF, all=TRUE) %>%
merge(TeamStats2023KeeperDF, all=TRUE)Next we want to get an overview of the player data. We will again focus on season 2022 to 2023 and 1st tier men’s competitions in Spain.
We will use the match urls we received in the previous step and use them to load in the different statistics available for each player and last we will merge the individual statistics tables into a combined table. Note the code below scrapes all the data of the internet and will take a while, patience is your friend.
# Individual player stats
PlayerStats2023SumDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "summary", team_or_player = "player")
PlayerStats2023PassDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "passing", team_or_player = "player")
PlayerStats2023PassTypesDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "passing_types", team_or_player = "player")
PlayerStats2023DefDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "defense", team_or_player = "player")
PlayerStats2023PosDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "possession", team_or_player = "player")
PlayerStats2023MiscDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "misc", team_or_player = "player")
PlayerStats2023KeeperDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "keeper", team_or_player = "player")
# Combine individual player stats
CombinedPlayerDataDF<-merge(PlayerStats2023SumDF,PlayerStats2023PassDF, all=TRUE)
CombinedPlayerDataDF <- CombinedPlayerDataDF %>%
merge(PlayerStats2023PassTypesDF,all=TRUE)%>%
merge(PlayerStats2023DefDF, all=TRUE)%>%
merge(PlayerStats2023PosDF, all=TRUE) %>%
merge(PlayerStats2023MiscDF, all=TRUE) %>%
merge(PlayerStats2023KeeperDF, all=TRUE)So far we have focussed on national competitions. If we want to extract data from non-domestics competitions (e.g. champions league or world cups) we will need to use slightly different code. Instead of using the country we will need to locate the relevant url for this competition. You can do so by going to https://fbref.com/en/comps/, clicking on the relevant competition and copying the url. For example if I was after champions league data I would copy the following url: https://fbref.com/en/comps/8/history/Champions-League-Seasons. Now let’s see if we can get the match urls and team summary statistics for the most recent men’s world cup.
MatchUrls <- fb_match_urls(country = "", gender = "M", season_end_year = 2022, non_dom_league_url = "https://fbref.com/en/comps/1/history/World-Cup-Seasons")
WorldCupTeamDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "summary", team_or_player = "team")
WorldCupPlayerDF <- fb_advanced_match_stats(match_url = MatchUrls, stat_type = "summary", team_or_player = "player")Now we have created three tables with player and team stats for two different competitions we can save these as RData. Doing this immediately means we do not have to go through the tedious process of scraping all data of the internet again.
saveRDS(CombinedTeamDataDF, file="C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/Saved Data/CombinedTeamData.rds")
saveRDS(CombinedPlayerDataDF, file="C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/Saved Data/CombinedPlayerData.rds")
saveRDS(WorldCupTeamDF, file="C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/Saved Data/WorldCupTeamData.rds")
saveRDS(WorldCupPlayerDF, file="C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/Saved Data/WorldCupPlayerData.rds")Exercise 1: Make sure worldfootballR and tidyverse are installed and loaded.
# Install packages
library(worldfootballR)
library(tidyverse)Exercise 2: Load all summary and possession team data for the 2018 National Women’s Soccer League (USA).
MatchUrls <- fb_match_urls(country = "USA", gender = "F", season_end_year = 2018, tier="1st")
NWSLTeamSummaryDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "summary", team_or_player = "team")
NWSLTeamPossessionDF <- fb_advanced_match_stats(MatchUrls = Match_url, stat_type = "possession", team_or_player = "team")Exercise 3: Merge your two dataframes together and call it NWSLTeamDF.
NWSLTeamDF <- merge(NWSLTeamSummaryDF,NWSLTeamPossessionDF)Exercise 4: Save your data file using writeRDS()
saveRDS(NWSLTeamDF, "C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/Saved Data/NWSLTeamData.rds")